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We report on early results of a numerical and statistical study of binary black hole inspirals. The 
two black holes are evolved using post-Newtonian approximations starting with initially randomly 
distributed spin vectors. We characterize certain aspects of the distribution shortly before merger. 
In particular we note the uniform distribution of black hole spin vector dot products shortly before 
merger and a high correlation between the initial and final black hole spin vector dot products in the 
equal-mass, maximally spinning case. These simulations were performed on Graphics Processing 
Units, and we demonstrate a speed-up of a factor 50 over a more conventional CPU implementation. 

PACS numbers: 



I. INTRODUCTION 

With a number of gravitational wave detectors 
(LIGO [|, Virgo 0, TAMA g, and GEO600 [|) now 
measuring signals at design sensitivity, the prospect of 
direct detection of gravitational radiation is becoming in- 
creasingly real and hence there is now a definite need to 
understand as much as possible about the most likely sig- 
nal source - the stellar- mass binary black hole (BBH) sys- 
tem. Over recent years numerical relativity has reached 
the point were accurate and reliable inspiral and merger 
calculations can be produced by a number of differ- 
ent groups ||. However, these calculations require sig- 
nificant computational effort, and the generation of a 
large-scale bank of numerical templates is currently in- 
tractable. Recent comparisons between numerical rela- 
tivity and post-Newtonian (PN) approximate waveforms 
show good agreement surprisingly close to merger (up to 
a few orbits) [1,0], corresponding to a gravitational wave 
frequency of about luqw ~ 0.1 or an orbital frequency of 
uo « 0.05. 

The main purpose of this work is to start a detailed 
study of the phase space of the PN equations of motion 
(see also Q). We randomly choose initial conditions for 
those equations corresponding to circular orbits at some 
chosen orbital frequency loq. We then integrate the PN 



equations of motion to some termination frequency ujf 
shortly before merger, but in a region where the approx- 
imation has still been validated by numerical relativity. 
There are a number of interesting physical questions one 
can address in this context. It might be that there are 
certain configurations that are preferred shortly before 
merger. If so, maybe these regions should be studied us- 
ing full numerical relativity Another motivation regards 
partial information: If one had from some other obser- 
vation (such as an electromagnetic counterpart) enough 
information about a BBH system long before merger to 
estimate a fraction of the parameters, it would be valu- 
able to be able to estimate the system's properties shortly 
before merger. For example, assume that the spin vector 
and mass of one BH have been measured, but only the 
mass of the other is known. Using an initially uniform 
distribution for the unknown spin vector an interesting 
question would be the final distribution of spin vectors 
shortly before merger. Would there be a uniform dis- 
tribution or would it be strongly peaked, and how does 
it change given changes in the known parameters? We 
start to address some of these questions in this work, but 
much remains to be addressed in future research. 

Another motivation for this work is provided by grav- 
itational recoil. Numerical relativity has been able to 
obtain estimates for the gravitational recoil in unequal- 
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mass 0, [1(1 [H| and spinning [T^, [H, [H| configurations; 
in particular, extremely high recoil values have been ob- 
tained for specific configurations Initial stud- 
ies using effective-one-body techniques have shown that 
very large kicks are unlikely to arise [l7|. Using the PN 
inspiral approach, it might be possible to predict the like- 
lihood of certain configurations shortly before merger and 
then relate this result to the recoil using one of the many 
recoil formulas available in the literature. 

To perform this study, we use the significant perfor- 
mance advantages of Graphics Processing Units (GPUs) 
to study the phase space of the BBH problem in a post- 
Newtonian (PN) setting. Each inspiral is described by 
a set of ordinary differential equations (ODEs) and the 
collection of inspirals are all decoupled. Therefore the 
computational problem is of the "embarrassingly paral- 
lel" kind, which is perfectly suited for GPUs. At this 
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where dS/dt = dSi/dt+dS 2 /dt, -f E = 0.577 ... is Euler's 
constant, and 6 = 1039/4620. The total mass is denoted 
by M = mi + m,2 and 77 = m 1 7712 /M 2 is the symmetric 
mass ratio. The magnitude of the angular momentum 



and H2 is obtained by 1 «-> 2. Note that the spin vectors 
Si are related to the spin unit vectors Si via S; = p^mf Si, 
i.e. Xi € [0, 1] is the Kerr spin parameter of BH i. The 



early stage in our studies, in which we focus on small 
portions of the full parameter space, the performance 
benefits provided by GPUs are not essential to perform 
the research. However, we anticipate that once we be- 
gin studies of larger sections of the parameter space the 
performance gains will become important. 



II. PN EQUATIONS OF MOTION 

We integrate the post-Newtonian equations (Eqs. (1)- 
(4) and (9)) from Ref. [3 (see Erratum Q3), which de- 
scribe a circular inspiral of 2 spinning BHs. The evolution 
is given by a system of coupled ODEs for the orbital fre- 
quency uj, the individual spin vectors Si for the 2 BHs, 
and the unit orbital angular momentum vector L„. 
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can be computed via |L„| = TyM 5 / 3 ^ -1 / 3 . 

The evolution of the individual spin vectors Si for the 
2 BHs is described by a precession around fii with 
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system of coupled ODEs for w, L n , and Si given mass and 
spin parameters mj, Xi are integrated from an initial fre- 
quency ujq to a final frequency tdf. Typically, we choose 
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too corresponding to an initial separation of r ~ 40M and 
u>f = 0.05, which is a conservative estimate of where the 
PN equations still hold [a, 0| . Integrating the equations 
in a range of to rather than t provides a slightly more 
gauge-invariant measure to compare different systems. 

The ODEs are integrated using the Dormand-Prince 
method. We set an error tolerance of 5 x 10~ 7 and start 
with an adaptive time step of size h = 10. 

The 2 initial spin vectors Si span a 6-dimcnsional pa- 
rameter space for each choice of parameters m., , ujq , and 
u>f. Note that the BBH problem is scale free, and we 
can therefore fix the total mass M = m± + TO2 = 1 and 
need to study only the dependence on mi. The last 2 
parameters ui and ujf are only interesting as consistency 
checks, and it is likely sufficient to analyze the situation 
for a few select choices. This still results in a challenging 
7-dimcnsional (mi {1 dof}, Si {3 dof}, and Si {3 dof}, 
dof: degrees of freedom) problem. Here we break down 
the problem into a much more manageable 4-dimcnsional 
one by fixing not just m*, but also the spin magnitudes 
Xii which then means only the unit spin vectors Si are 
chosen freely. Each unit spin vector has only 2 degrees 
of freedom (since the third is given by the normalization 
condition). 



III. NOTES ON GPU COMPUTING 

We use NVIDIA's CUDA runtime and programming 
system [2(3] to implement the PN evolution code on 
GPUs. The GPUs today provide most impressive speed- 
ups over CPUs for single-precision computations. GPUs 
are primarily designed for computer games and hence 
current and future capabilities are essentially set by that 
market. This means it is unlikely that double-precision 
will be supported at similar speed-ups as single-precision. 
While double-precision arithmetic is now supported on 
some of the high-end cards, the performance gains over 
CPU codes are not very impressive and in our opinion do 
not yet justify the effort of porting the code to the CUDA 
infrastructure if double-precision is necessary throughout 
the code. Note that some results have been obtained 
where the code has been examined carefully and only a 
few critical computations where done in double-precision, 
providing most of the single-precision speed-up. By com- 
paring single-precision and double-precision CPU results 
on a number of inspiral configurations we have verified 
single-precision is sufficient for our problem. 

Porting the code to the CUDA architecture was rela- 
tively straightforward - wc implemented the right-hand- 
side (RHS) computations of the ODEs on the GPU in 
a device kernel. The ODE integrator runs on the CPU 
and spawns kernels on the GPU, transferring the initial 
state. The full evolution is then performed on the GPU, 
including all necessary calculations of the RHS. At the 
end of the simulation, the state is transferred back to the 
CPU for output and analysis. 

GPUs demonstrate significant global memory access 



latency (for the NVIDIA SI 070, 400-600 cycles [IJ), 
and even worse access times are encountered for "non- 
coalesced" memory access. The latter is defined slig htly 
differently on different NVIDIA GPU generations [Iff, 
but for the latest generation it happens most frequently 
if multiple memory segments are accessed within a group 
of 16 threads called a half-warp. We have measured the 
non-coalesced memory stores and loads using the CUDA 
profiler and found no uncoalesced memory access, signifi- 
cantly simplifying the programming since this eliminates 
the need to avoid certain global memory access patterns 
using shared memory. 

We found error detection and result verification on the 
GPU to be critical exercises. The computations on the 
GPUs are surprisingly resilient to errors happening on 
the cards. For instance, the on-board memory is not 
error correcting and kernel failures are not caught by de- 
fault. This "error resilience" can lead to rather interest- 
ing failure modes from a scientific computing perspective. 
When too many inspirals are spawned simultaneously the 
runtime warns and terminates the program with an er- 
ror message. As the number of inspirals is reduced a 
regime of "silent" failure is entered, where the program 
runs without any indication of error or warning, but it 
produces incorrect results. This can be caught by ex- 
plicitly checking each kernel, but the runtime does not 
generate errors itself. Issues like this highlight the case 
for independent error checks for scientific computing. 

We therefore decided to write our code in a multi- 
threaded way on the CPU. We first generate the ran- 
dom initial data for our inspiral studies on the GPU 
for the maximum number of parallel inspirals that can 
run successfully. We then transfer the initial data to the 
CPU and select a random subset of typically about 1% 
of these inspirals on the CPU. While the GPU is per- 
forming all the inspirals, we also evolve the selected in- 
spirals on the CPU in double-precision in a separate CPU 
thread. After the GPU is done, we copy the data back to 
the CPU memory and compare the single-precision GPU 
data against the double-precision CPU data for the se- 
lected subset. This also validates that single-precision 
does not introduce unacceptable errors in this problem. 
We believe such cross checks are currently crucial for 
GPU computing. We want to stress however that we 
have not observed problems or errors once the initial ones 
were worked out. 



IV. PERFORMANCE RESULTS 

We now evaluate performance advantages GPUs de- 
liver in this context. We executed our code on a single 
core of the quad-core Intel Xeon E5410 CPU running at 
2.33 GHz, which is rated at around 5GFlops in double- 
precision. Note that the problem is "embarrassingly par- 
allel," so the CPU will be able to provide excellent scaling 
over the 4 cores. We integrated one of our test inspirals 
in the range [ojq = 0.004, = 0.1] 100 times serially 
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FIG. 1: Performance of the GPU card. We vary the num- 
ber of inspirals N scheduled simultaneously on the GPU. For 
fewer than 100,000 inspirals peak performance is not reached, 
as there are not enough simultaneously executing threads to 
mask memory access latency. At 100,000 inspirals saturation 
is reached as the GPU works serially through the extra inspi- 
rals. 



and found we could achieve around 0.057 inspirals per 
millisecond (ms) on a single core. 

For comparison, one of the four GPUs on our high-end 
unit (the NVIDIA Tesla S1070) is rated at 1035 GFlops in 
single-precision, delivering a theoretical performance ad- 
vantage of about a factor 200 over the single-core CPU. 
We ran our test setup, spawning a large number of simul- 
taneous inspirals in parallel. Fig. [T] shows results for the 
performance of the GPU card. As we increase the number 
of simultaneously scheduled inspirals N the speed levels 
off at about 2.7runs/ms. The GPU has 240 processors, 
which can work in parallel. Naively one would therefore 
expect the performance to improve until 240 inspirals are 
performed in parallel and after that the performance to 
level off as the processors need to work through the in- 
spirals serially. However, the runtime CUDA scheduler 
has to be able to keep these 240 processors working si- 
multaneously without waiting for memory access, which 
can be extremely slow. This is the likely reason why the 
performance still rises even after N = 10, 000, as the run- 
time has a better chance of squeezing the optimal perfor- 
mance out of the card by interleaving different inspirals 
without having to wait for costly memory transfers. We 
performed this study using block sizes of 256 threads. We 
found similar behavior with 128-thread blocks. 

Based on this data, we achieve a very impressive speed- 
up of about a factor 50. While this comparison is slightly 
unfair to the CPU because we only use a single core and 
double-precision on the CPU, there is no doubt the per- 
formance gains are very significant. Note also that for 
this test problem we integrated the exact same problem 
in all cases. This means the individual threads run in 
perfect lock-step, the best possible case for the GPU. We 
typically see differences of about 10% in the runtime of 
different inspirals, and this would result in a slight ineffi- 
ciency on the GPU as some of the threads in a block may 
finish before others. The double versus single-precision 
on the CPU has only a moderate effect, and the paral- 



lclization on the CPU will only be simple as long as the 
problem remains of the "embarrassingly parallel" type. 
We plan to extend our studies toward more dynamic ex- 
plorations of phase space requiring dynamic feedback and 
communication between individual results. This would 
then require a significant implementation effort on the 
CPU in actually parallelizing the code, which is likely to 
be more difficult than the conversion to CUDA. 



V. SOME INSPIRAL RESULTS 

In this section we highlight some of our findings from 
a completed initial study of the inspiral space. We em- 
phasize that at this point we have focused on a few select 
situations only. We plan to soon move on to study larger 
regions of the initial configuration space and use more 
advanced statistical analysis tools to try to dynamically 
identify "interesting" regions. 

For these simulations we integrate the ODEs Eq. (JTJ- 
dSJ) in the range lo q = 0.00395 -»• w f = 0.05. The ini- 
tial orbital frequency luq corresponds to a separation of 
r ss 40M, and the final uif is chosen such that the grav- 
itational wave frequency still matches numerical relativ- 
ity results. We have varied both frequencies and, except 
in one case, have not found significant qualitative dif- 
ferences in the results. We leave detailed experimenta- 
tion with different ojq and ojj as a topic of future study. 
The initial unit angular momentum is chosen as the unit 
vector in the z-direction L„ = (0, 0, 1) without loss of 
generality. 

For each choice of the black holes masses and spin mag- 
nitudes we sample in a uniform and random way their 
spin orientations. The latter corresponds to uniformly 
sampling a sphere, using the algorithm of [2l|. We use 
N = 100, 000 randomly selected initial spin orientations 
for each black hole. That is, a total of N = 10 8 spin 
configurations. 

In analyzing the data we have found the scalar prod- 
ucts between the different unit vectors a useful quantity 
to investigate. This is also motivated by the crucial role 
these scalar products play in the recoil velocity. 



A. Equal-mass, maximally spinning black holes 

We start by looking at the case of two equal-mass 
(mi = 0.5), maximally spinning (\i = 1) black holes 
with randomly oriented initial unit-vector spin configu- 
rations. For this case all the pre-factors in the ODEs 
Eq. fTJ-Eq. {3| for the spin terms are identical and the 
Y]j XiL • Sj term in the RHS for to in particular simplifies 
to a form L • S with S = Si + S2. Note however that 
there is still a term of the form Si • S2, which will change 
for the different configurations used. 

As a first measure of the spin dynamics, we look at the 
scalar product between the two unit spin vectors Si • S2 
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FIG. 2: Uniform distribution of the scalar product Si ■ S2 of 
the final configuration at uif. The plot shows the convergence 
of the Gaussian kernel estimated probability density function 
for different N towards the uniform distribution case (which 
would be at 0.5) as well as a histogram for the N = 100, 000 
case. 



as well as the scalar products with the unit angular mo- 
mentum Si ■ L n . The simulations start out from an ini- 
tially uniform distribution in each of these scalar prod- 
ucts. Fig. [2] shows a histogram of the final value of Si ■ S2 
for N = 100, 000 inspirals and iV bins = 20 bins. The fig- 
ure also shows the probability density estimator for the 
final value of Si-S2 atw/, using a Gaussian kernel for dif- 
ferent choices of the number of inspirals N. The density 
estimator is given by p(x) = l/(Nh) J^f K ((x — Xj)/h), 
where h is the bandwidth and for the Gaussian kernel: 
K((x-Xi)/h) = l/v / 2Trcxjp(-(x-x l ) 2 /2h 2 ). We con- 
struct the bandwidth h using h = l.OGcriV -1 / 5 , where a 
is the standard deviation and N is the number of inspi- 
rals [13]. As N is increased, the probability density es- 
timator goes toward a uniform distribution. This shows 
that there is no structure in the final spin scalar prod- 
uct, i.e. that it is uniformly distributed. Note that the 
steep fall-off near the boundaries is an artifact of the 
estimator, which assumes the variable is distributed in 
the real line instead of [—1, 1] and smoothes out the dis- 
continuous jump at ±1. This feature converges away in 
N. For Sj • L„ the distributions look very similar. In 
the equal-mass, maximum spin case there is no preferred 
angle between the spin vectors generated in the inspiral. 

To further validate this even distribution of final dot 
product values, we used the Kolmogorov-Smirnov test to 
measure the uniformncss of the final values. For this test, 
the BHs had equal mass and were non-maximally spin- 
ning, with Xi = 0.7. We used a large sample contain- 
ing 362,799,815 inspirals with random Sj vectors over 
the unit sphere. The K-S test returned a p- value of 
5.17 x 10~ 5 , indicating a very uniform final distribution. 
The final histogram for this large test is displayed in 
Fig. [3J using -/Vbi ns = 100 bins. It is visually apparent 
that no final dot product of the spin vectors is favored. 

While the final distribution in Si • S2 remains uni- 
form, we do find a high linear correlation between the 
initial and final values of this scalar product, as can 
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FIG. 3: Uniform distribution of the scalar product Si • S2 
of the final configuration at cof when BH masses are even 
and Xi — 0.7. The plot shows a histogram for the N = 
362, 799, 815 case. 
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FIG. 4: Final versus initial spin scalar product Si ■ S2. The 
plot shows a high correlation in the scalar product between 
initial and final scalar product for the equal mass (mi = 7712 = 
0.5), maximally spinning (xi = X2 = 1) case. In light gray 
we show the 0-contour line to make it easier to see how far 
the configurations have spread. 



be seen in Fig. [4] The latter shows a bivariate his- 
togram for (Si • S>2)i and (Si ■ S2)/. Note that the 
histogram is normalized such that the maximum value 
is 1. We have also plotted the 0-contour line in light 
gray to make it easier to see how far configurations have 
spread. We have used iVhist = 79 bins in each dimen- 
sion, which was found by following the procedure in 
Ref. [23I ] and forcing equal numbers of bins in both di- 
mensions. This linear correlation can be quantified using 
the Pearson product-moment coefficient which is given 
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FIG. 5: Final versus initial scalar product Si -L„ for the same 
case as in Fig. [4] There is much less correlation between these 
initial and final scalar products compared to Si • S2. 



where 



by r = 1/(N-1) Ef=i (0* ~ ((». 
x,?/ are the sample means, cr x ,a y are the standard devi- 
ations, and iV is the number of inspirals. For the case of 
Fig. |4]we find r = 0.99. Note that the correlations for 
Si • L„ and S2 • L n are not nearly as high, with r = 0.65 
for both cases. In Fig. [5] we show a similar histogram 
plot for Si ■ L n . The case for S2 ■ L n looks similar. The 
lower correlation is clearly visible compared to Fig. 0] as 
the values are much further spread from the diagonal in 
Fig. El 
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FIG. 6: The correlation coefficient r of (Si ■ 82)4 and (Si • S2) / 
for maximally spinning black holes as the mass of the bodies 
is changed. The correlation falls quickly as the masses become 
uneven. 
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FIG. 7: The correlation coefficient r between the initial and 
final values of Si ■ S 2 and Sj ■ L n (j — 1,2). The masses 
are kept constant and equal (mi = mi = 0.5) while the spin 
parameters are changed at the same rate, xi = X2 = X- The 
correlation remains nearly constant and large for the Si ■ S2 
scalar product, but varies for Sj • L n . 



B. Unequal-mass, maximally spinning black holes 

Having found the high correlation between the initial 
and final scalar products of spin vectors mentioned above 
we now change the mass m\ of one BH while keeping xi = 
\2 = 1 and study the resulting change in correlation. In 
Fig. [5] we show the correlation coefficient r of the initial 
and final values of Si • S2 as the mass mi changes. Recall 
that we normalized the masses such that M = mi+m2 = 
1, and hence mi G [0,1]. Fig.[H]shows that the correlation 
drops quickly as the masses become uneven. This would 
indicate it would be harder to predict in a probabilistic 
sense the final value of this scalar product for a of a BBH 
inspiral in which the BHs have dissimilar masses and high 
spins. 



C. Equal-mass, equal-spin, non-maximally spinning 

Next we vary the spin x for each black hole while keep- 
ing the same rate (xi = X2), and fixing the masses to be 
even (mi = m,2 = 0.5) to see if there is a similar behav- 



ior in the correlation between (Si • S2)i and (Si • S2) / as 
we found when changing mi. A similar behavior could 
be expected as the mass and spin parameters enter simi- 
larly in the spin evolution equations. However, as Fig. [7] 
shows, the correlation coefficient r of the initial and fi- 
nal values of Si • S2 remains constantly large (r > 0.98), 
unlike the case of changing mi above. In Fig. [7] we also 
show the correlation coefficient r for S^ • L„ (j = 1,2), 
which shows significant variation. In the case of Sj ■ L n 
the behavior is similar for each j , which is expected since 
both BHs enter symmetrically into the expressions. The 
only difference between the BHs is their initial location. 

D. mi = 0.4, equal-spin \ ~ 0-05 

As an example of the rich structure that can be found 
if one leaves the equal mass case, we show in Fig. [5] a plot 
of the final versus initial spin scalar product Si -Sb for the 
case of mi = 0.4, and xi — X2 = X — 0.05. In this case 
a ring-like structure appears in the scalar product and 
the data is now anti-correlated with r = —0.14. Though 
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FIG. 8: Final versus initial scalar product of spin vectors 
Si • S2 for mi = 0.4 and equal-spin \ = 0.05 as an example of 
the richer structure away from highly symmetric cases. Note 
that the correlation is now negative. 



inspiral of 2 black holes. Using the inherent parallelism 
of the problem of the inspiral space study of this sys- 
tem, we chose to implement this on Graphics Processors 
well adapted to the task. We achieved a speed-up of 50 x 
compared to a single-core CPU. This speed-up will be 
important for more advanced studies of large numbers of 
inspirals planned for the future. 

First results from initial studies indicate there is a rich 
structure for certain regions of initial inspiral conditions. 
For an example see Fig. [SJ In particular, the equal-mass 
case appears special, leading to more correlated dynamics 
compared to the equal-spin case. We used scalar prod- 
ucts between the unit vectors to judge the dynamics. 

For the future we plan to significantly expand our stud- 
ies of the initial configuration space and to use more ad- 
vanced statistical methods for analysis. 



in other investigations presented in this paper the results 
were generally insensitive to the values of loq and u>f, 
we found in this case that the structure of the bivariate 
histogram could vary significantly based on these values. 
We leave a more full investigation of these cases to future 
work. 



VI. DISCUSSION 

We have described our implementation of parallel in- 
tegrations of the PN equations describing the circular 
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